Sequencing and Raw Sequence Data Quality Control ◾ 29
content across the reads in a FASTQ file and then compares it to the theoretical normal
distribution of the GC content, which is estimated from the observed data. If there is no
sequencing bias and the library is random, we will expect that the observed distribution of
the GC content of the reads to be approximately normal and roughly similar to the theo-
retical distribution in which the central peak corresponds to the overall GC content of the
underlying genome. Deviation of the distribution of the per sequence GC content from the
normal distribution is an indication of a contaminated library or a fault in the sequencing
process. However, a bell-shaped normal curve that deviates from theoretical curve may or
may not be biased. In this case, there is a chance that the observed distribution may repre-
sent the actual distribution of the genome of the organism; therefore, no warning will be
issued.
FIGURE 1.20 Normal and abnormal per sequence GC content.
FIGURE 1.19 Normal and abnormal per base sequence content.